Skip to main content

Key Takeaways

  1. Two-stage retrieval is production standard: Fast first-pass (BM25/vectors) + precise reranking; often yields higher accuracy than single-stage.
  2. Hybrid search wins: Combining BM25 + semantic catches both exact and conceptual matches and is widely used in production.
  3. Chunking strategy matters: Structure-aware chunking with metadata dramatically improves retrieval quality. One-size-fits-all fails.
  4. Evaluate retrieval separately from generation: Many RAG failures stem from retrieval issues. Debug component-by-component.
  5. Golden datasets are non-negotiable: You cannot improve what you don’t measure. Build 50-100 test cases minimum.
Production Checklist: Before deploying RAG to production, ensure:
  • Two-stage retrieval implemented (first-pass + reranking)
  • Appropriate chunking strategy for your document types
  • Metadata extraction for filtering and context
  • Hybrid search (BM25 + semantic) or justified single-strategy
  • Golden evaluation dataset (50+ test cases)
  • Automated evaluation pipeline (CI/CD integration)
  • Component-level monitoring (retrieval and generation metrics)
  • Error handling for empty results and low-confidence answers
  • Observability (logging queries, retrieved docs, answers)
  • Cost optimization (caching, efficient embedding models)

Common Pitfalls Recap

Single-stage retrieval: Sacrifices either speed or accuracy
Ignoring metadata: Misses substantial potential accuracy improvements
No evaluation: “Vibe checks” fail in production
Uniform chunking: One strategy doesn’t fit all document types
Skipping OCR preprocessing: Poor quality in, poor results out
No golden dataset: Cannot measure improvement or regression
What’s new:
  • Unified RAG evaluation with side-by-side retrieval, reranking, and end-to-end answer grading using human and LLM feedback, with rerank list inspection UIs. (RankArena, 2025)
  • Modular retrieval+rerank toolkits simplify A/B testing across BM25, dense, hybrid, and cross-encoders via plug-and-play configs and ablations. (Rankify, 2025)
  • “Hybrid by default” guidance: combine sparse+dense first-pass with cross-encoder rerank; emphasize score fusion and pipeline parallelism. (2025)
  • Practitioner playbooks stress “fast recall → precise rerank” to reduce top-k noise without large cost/latency increases. (2025)
  • Serving optimization work proposes practical knobs (candidate pool sizes, batching, parallel fusion) for latency/cost control. (RAGO, 2025)
Vendor-specific (awareness):
  • Milvus: open-source vector DB focused on scale; GPU-accelerated indexing and billion-scale search for low-latency RAG.
  • Qdrant: Rust-based vector engine with strong filtering; emphasizes high performance and hybrid/rerank integrations.
  • pgvector: PostgreSQL extension enabling vector search and hybrid (SQL + vectors) inside existing relational stacks.
  • Pinecone: managed vector DB; guidance centers on hybrid search, namespaces, and rerank integration for production RAG.
  • Weaviate: vector DB with built-in hybrid (sparse+dense) and modules for rerankers.
  • Elasticsearch/OpenSearch: BM25 + kNN hybrid patterns via dense vectors alongside mature keyword filters.
  • LlamaIndex: LlamaCloud + LlamaParse for managed parsing/ingestion and advanced PDF/table parsing integrated with LlamaIndex pipelines (public preview).
  • Google Gemini: Gemini API File Search enables file-backed retrieval with managed indexing, chunking, and citation-friendly results for Gemini models.
References: